Identification of monotonically expressed long non-coding RNA signatures for breast cancer using variational autoencoders

As breast cancer is a multistage progression disease resulting from a genetic sequence of mutations, understanding the genes whose expression values increase or decrease monotonically across pathologic stages can provide insightful clues about how breast cancer initiates and advances. Utilizing variational autoencoder (VAE) networks in conjunction with traditional statistical testing, we successfully ascertain long non-coding RNAs (lncRNAs) that exhibit monotonically differential expression values in breast cancer. Subsequently, we validate that the identified lncRNAs really present monotonically changed patterns. The proposed procedure identified 248 monotonically decreasing expressed and 115 increasing expressed lncRNAs. They correspond to a total of 65 and 33 genes respectively, which possess unique known gene symbols. Some of them are associated with breast cancer, as suggested by previous studies. Furthermore, enriched pathways by the target mRNAs of these identified lncRNAs include the Wnt signaling pathway, human papillomavirus (HPV) infection, and Rap 1 signaling pathway, which have been shown to play crucial roles in the initiation and development of breast cancer. Additionally, we trained a VAE model using the entire dataset. To assess the effectiveness of the identified lncRNAs, a microarray dataset was employed as the test set. The results obtained from this evaluation were deemed satisfactory. In conclusion, further experimental validation of these lncRNAs with a large-sized study is warranted, and the proposed procedure is highly recommended.


Introduction
Breast cancer is the most commonly diagnosed cancer and the most frequent cause of cancer death among women worldwide [1]. Since cancer is a multistage progression process resulting from genetic sequence mutations [2], the genes whose expression values increase or decrease monotonically across stages are expected to play essential roles in the tumor progression and metastasis. The identification of these monotonically differentially expressed genes (MEGs) offers valuable insights into the progression of cancer. Several studies have investigated MEGs across stages for a variety of cancer types, such as lung cancer [3,4], colon cancer [4,5], and liver cancer [6]. Certainly, breast cancer is also included [7]. Long non-coding RNAs (lncRNAs) are post-transcriptional and epigenetic regulators, with usually lower and more tissue-specific expression levels compared to protein-coding genes. Once regarded as evolutionary junk, lncRNAs are now known to play essential roles in complex diseases, including cancer [8]. In respect of breast cancer, deregulation of lncRNAs, such as XIST [9,10], H19 [11], and HOTAIR [12,13], has been reported in tumor tissues or cell lines. A survey of the PubMed database using the keywords "lncRNA" and specific cancer types (as depicted in Fig 1) revealed that breast cancer ranks as the second most frequently studied cancer type in relation to lncRNAs. This observation highlights the significance of exploring the pivotal role of lncRNAs in breast cancer further.
To our knowledge, no prior study has focused specifically on lncRNAs whose expression values changed monotonically across stages of breast cancer, especially using deep learning methods. Deep learning methods, which are neural networks that comprise multiple layers with non-linear activation functions, have gained immense popularity in many fields, including life science and medicine [14]. Nevertheless, it is worth pointing out that the development and utilization of deep learning methods in the biomedical research face two key obstacles [15,16], namely the curse of dimensionality and the lack of a good biological interpretation. Deep learning techniques are predominantly employed in the analysis of single cell RNA-sequencing (scRNA-seq) data, which can be considered as "big-big" data due to their large number of cells (n) and features (p). As a result, the problem of over-fitting issue is relatively less severe in scRNA-seq data compared to bulk RNA-seq data and microarray data. The cancer genome atlas (TCGA) breast cancer cohort had enrolled~1,000 subjects that is relatively large compared to other cohorts, alleviating the curse of dimensionality to a certain extent. Due to this reason, as well as to showcase how to identify and verify MEGs with the aid of a deep learning method, the TCGA breast cancer was chosen.
Furthermore, in contrast to other applications such as computer vision (CV) and neural language processing (NLP), a good biological interpretation is a must for the biomedical research. The "black box" operation of the deep learning methods blurred an explicit interpretation, and thus impeded their widespread utilization in the circles of biology and medicine. Nevertheless, reconstructed gene expression representative values using autoencoder frameworks have previously been shown to reveal novel biological patterns [17,18]. As anticipated, autoencoder frameworks are expected to unveil monotonically changing patterns as the disease advances. A conventional autoencoder framework comprises two essential components: the encoder and the decoder. First, the data were compressed into a low-dimensional vector representation using the encoder. Next, the low-dimensional vector was uncompressed by the decoder to obtain the reconstructed data to minimize the reconstruction error (i.e., the difference between the original data and the reconstructed data), so the reconstructed data could most closely mimic the original data. Several review articles [19] showed an autoencoder cannot generate new samples, which are not in the original data.
Currently, a variational autoencoder (VAE) model is among the commonest used deep learning methods for modeling gene expression data, taking the work by [20,21] as an example. Compared to an autoencoder, a VAE is capable of generating representative features that capture the underlying structures of the data, instead of limiting to the original data and "memorizing" them [19]. This task is accomplished by mapping the data into a distribution with a low dimension. The structures being investigated can encompass various phenotypes, cell types, and other characteristics specific to the samples. Therefore, the VAE framework is anticipated to be capable of detecting the specific structure of interest in this study, namely the monotonically changing patterns in expression values as the breast cancer progresses.
In this article, we used the representative expression values learned from VAE models to parameterize monotonically increasing and decreasing lncRNAs, respectively, across pathologic stages of breast cancer and thus justified that the expression levels of identified lncRNAs changed monotonically.

Experimental data
The gene expression profiles and corresponding clinical information of the breast cancer (BRCA) cohort from TCGA were obtained from the TCGA's Genomic Data Commons (http://www.cbioportal.org/). The cohort comprised 140 patients at stage I, 480 patients at stage II, 180 patients at stage III, and 105 para-carcinoma tissues serving as normal controls, which were included for the downstream analysis. Of note, the patients at stage IV were excluded due to the limited number of individuals (only 14 patients) in this advanced stage within the TCGA BRCA cohort.
The annotation of lncRNAs were retrieved from the TANRIC [22] database, which included 12,727 unique lncRNAs gene symbols. The gene expression profiles were quantified as the fragments per kilo-bases per million (FPKM) counts, then these expression values were subsequently standardized to have means of 0's and variances of 1's. For validation purposes, a microarray dataset with 121 individuals included was considered. The expression matrix for this dataset was directly downloaded from the gene expression omnibus (GEO) database under the accession number of GSE42568 (https://ncbi.nlm.nih.gov/gds/?term=GSE42568).

Identification of monotonically differentially expressed genes
In the proposed procedure of identifying monotonically differentially expressed genes, moderate t-tests (implemented by the R limma package) were conducted to screen out the differentially expressed genes (DEGs); the multiple testing issue was addressed by adjusting p-values with the Benjamini-Hochberg method [23].
As different change patterns, such as "U-shaped", "waved", and monotonic patterns, were involved in those DEGs, the following constraints were added in order to extract out the desired monotonic patterns. To investigate the scenario of monotonically decreasing gene expression, we focused on the intersection of three sets of under-expressed DEGs: stage I versus normal, stage II versus normal, stage III versus normal. The assessment of fold change at the logarithmic scale (LFC) was performed. The corresponding restriction is LFC I > LFC II > LFC III . In contrast, for the monotonically increasing scenario, LFC I < LFC II < LFC III was additionally required and hence added upon the intersection of over-expressed DEG sets. Since patterns other than monotonic ones are not our focus areas, which may be regarded as "noises", the inclusion of other patterns would blur the true signals of interest. Therefore, the addition of proposed restrictions should help discard these noises.

Implementation of VAE models to validate MEGs
A VAE employs a similar strategy to an AE; the only difference between VAE and AE is that in the VAE model, each data point is represented by a set of latent variables that can be decoded by neural networks to produce parameters for a probability distribution (usually a Gaussian distribution). Hence, VAE defines a generative model. In this study, we used a VAE model to justify the identified MEGs in the previous subsection present monotonically changed patterns across stages, indeed.
This procedure is based on the premise that, for each stage, the pattern of all monotonically increasing changed genes (with similar or identical underlying patterns) can be captured by the corresponding VAE model's underlying distribution. The same rationale is applicable to monotonically decreasing genes as well. Then if the corresponding means of these distributions over stages present monotonic changes (and are statistically significant since we also have the variances of these distributions, hypothesis tests could be carried out. Here, the underlying null hypothesis is μ normal = μ stage I = μ stage II = μ stage III ), the identified lncRNAs are certainly monotonic.
The encoders of VAE considered in the current study comprised of the following layers: the first dense network connects the input layer to the first hidden layer, the number of nodes is 32, and the second network with 8 nodes. Rectified Linear Unit (ReLU) is the activation functions used in the dense network. Thereafter a hidden layer is connected to the bottleneck layer, which consists of one network for the mean and another network for the variance on the logarithmic scale of the hidden distribution.
The decoder of VAE comprised of the layers in the inverse order, namely, the first full-connected layer connects the bottleneck layer to an 8-node hidden layer, and the second full-connected layer with 32 nodes, and the output layer decoding back the dimension of original data without an activation function. The Adam method was used as the optimizer, and default settings for the learning rate and other hyperparameters were employed. All VAE modeling was stopped at 20 epochs, which have shown that mean squared error (MSE) metrics in the VAE models converge. Fig 2 depicts the flowchart of this proposed procedure.

Biological relevance and enrichment analysis
For the identified monotonic genes, their ensemble gene identifiers were converted to The HUGO Gene Nomenclature Committee (HGNC) approved gene symbols using the GEN-CODE (https://www.gencodegenes.org/) database (version 38). The GeneCards (https://www. genecards.org) and PubMed databases were searched to survey for the biological relevance of identified monotonically changed lncRNAs with unique gene symbols.
The target mRNAs by the identified lncRNAs were retrieved from the lncRNA Disease 2.0 database [24]. The String software (www.string-db.org) was used to perform pathway enrichment analysis to identify the pathways/gene sets enriched by the target mRNAs.

Statistical language and packages
R language version 4.0 (www.r-project.org) was used to perform data analyses in this study. To construct VAE models, the Keras package with a TensorFlow backend (with a call on Python language, version 3.7) was employed. Additionally, the limma package was utilized for fitting moderated t-tests.

Results and discussion
First, the DEGs were identified using moderated t-tests. Setting the cutoff value of the false discovery rate (FDR) at 0.01, for the first comparison in which the patients at stage I and the normal controls were considered, 2,629 lncRNAs were identified as down-regulated DEGs and 838 lncRNAs as up-regulated DEGs, respectively. For the second comparison (stage II versus control), there were 2,966 and 1,205 DEGs, while for the last comparison (stage III versus control), the numbers were 2,970 and 887, respectively. Interestingly, the number of down-regulated DEGs was found to be approximately two times higher than the number of up-regulated DEGs. Then we put additional restrictions (as suggested in the Method section) in order to discard other patterns rather than the monotonic ones. This resulted in 248 monotonically decreasing DEGs and 115 monotonically increasing DEGs, which were input into the VAE models for further justification.
The expression profiles of these 248 monotonically decreasing DEGs and 115 monotonically increasing DEGs were fed into the VAE networks (4 VAE models for the decreasing scenarios, including respective individual ones for normal, stage I, stage II and stage III, and the other 4 VAE models for the increasing scenarios, respectively). The representative parameters, namely the mean and variance of underlying latent distribution were extracted. Subsequently, hypothesis testing was performed based on these parameters. Fig 3 shows that except for the comparison between stage II and stage I in the down-regulated case (p = 0.887), the statistically significant difference between the pairs in the other comparisons was evidence. To conclude, the monotonically changed patterns of identified lncRNAs were obvious.
After transferring the annotation of identified lncRNAs to HUGO gene symbols, the 248 down-regulated and 115 up-regulated MEGs corresponded to 65 and 33 unique lncRNAs, respectively. Out of the 98 identified unique monotonically differentially expressed lncRNAs, 9 were found to be unannotated in the GeneCards database. Among the remaining lncRNAs, For the remaining lncRNAs, 16 were directly associated with breast cancer, meaning previous studies have established a link between these identified gene and breast cancer. Additionally, 44 lncRNAs were indirectly associated with breast cancer, as previous studies have demonstrated that the genes sharing the same biological pathways as the identified genes are associated with breast cancer. In total, approximately 60% of the identified lncRNAs were found to be associated with breast cancer based on the information from the GeneCards database. For instance, using sequencing technique (sequenced on seven pairs of tumor and normal tissues) and bioinformatics methods, LINC00636 was identified as a hub gene in the co-expression network constructed for HER-2-enriched subtype breast cancer [25]. Furthermore, using realtime quantitative polymerase chain reaction (qRT-PCR), MIAT was experimentally validated as being over-expressed in breast cancer patients compared to normal controls [26]. LINC01016 was also experimental validated to associate with breast cancer, saying that this gene's silencing results in reduced proliferation, and it has a prognostic value on the survival time of breast cancer patients [27]. The regulation direction of MIAT and LINC01016 is in consistent with our results.
Additionally, we explored the potential biological relevance of the target mRNAs by these lncRNAs. The target mRNAs by these 98 lncRNAs were retrieved from the lncRNA Disease 2.0 database. Then the enriched KEGG pathways and GO terms of target mRNAs were obtained using the String software [28] and presented in Tables 1 & 2, one corresponding to the increasing scenario and the other to the decreasing scenario. Of the enriched Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways, several pathways such as HIF-1 signaling pathway [29,30], Wnt signaling pathway [31][32][33], human papillomavirus infection [34,35], and Rap 1 signaling pathway [36] have been demonstrated to associate with breast cancer previously. On the other hand, of the enriched GO terms, focal adhesion [37] is relevant to breast cancer, as suggested by the literature. In summary, at the level of pathways, a good biological implication of identified lncRNAs to breast cancer is harnessed.
Finally, to assess the discriminative ability of the identified lncRNAs, gene expression values from the GSE42568 microarray dataset were utilized for validation purposes. This dataset included 17 normal controls, 11 patients at stage I, 40 patients at stage II, and 53 patients at stage III. As the gene annotations differed between the two platforms, we opted to fit an overall VAE model based on the common annotated 40 lncRNAs present in both platforms. The hyperparameters of this overall VAE network were set be identical to those of the separate VAE networks. Overall, 28 patients at different stages were misclassified regarding their histologic stages, resulting in an error rate of 23.14%. Notably, neither normal controls nor breast cancer patients have been misclassified, indicating that the performance of these MEGs is satisfactory.
We acknowledge the present study has several limitations. First, it is well known that breast cancer is a very heterogeneous disease, and different subtypes have district underlying molecular mechanisms and prognoses. The subgroup analyses were not considered in this study given that with a further stratification on subtypes would increase the likelihood of over-fitting dramatically (resulting from the fact that the number of samples in each stratum would drop substantially). Furthermore, it is worth noting that patients at stage IV were excluded in this study due to the limited number of patients at this specific stage (only 14) within the TCGA BRCA cohort. While understanding the progression of breast cancer to this advanced stage is of significant interest, a large-scale study is highly desirable. Additionally, the deep learning model considered in this study, specifically a VAE model with 7 layers can be considered relatively shallow in terms of its complexity. Given that the sample size for each stratum ranges approximately from 100 to 500, such sizes of discovery data are insignificant to train a deep learning model enclosed with a remarkably large number of hidden layers.

Conclusions
A significant contribution of this study is the ingenious combination of a VAE network, which generates a representation feature representing all genes with a specific change pattern with the classical approach of statistical hypothesis testing. This integrated analysis enables the identification and subsequent validation of monotonically changing genes across different stages of breast cancer. Thus, it enlightens us on a potential stretch for the application of deep learning methods.
Our analysis results show that the identified lncRNAs are indeed changed monotonically across stages and have good biological implications in breast cancer. Experimental validation of these lncRNAs with a large-scale study is warranted in the future. In conclusion, the proposed procedure is highly recommended.